Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Modifying clojure CNN text classification example #13865

Merged

Conversation

kedarbellare
Copy link
Contributor

Description

  • This PR modifies the clojure CNN text classification example. It tries to come close to the python-based example and also fixes some bugs.
  • It can be used with pretrained embeddings such as :glove or :word2vec. It can also be used without any pretrained embedding in which case an embedding is learned from training data.
  • There was an issue with using pretrained embeddings where a new random vector (uniformly sampled) was generated for the same word in the vocabulary if it appeared multiple times across the corpus. This PR fixes that issue by generating a random vector just once per out-of-vocabulary (OOV) word. This also helps avoid the memory GC limit exceeded issue.
  • Finally, it can be used with word2vec embeddings which required some tweaks in the way the model was being loaded. To optimize memory usage, either max-vectors can be loaded or a vocab can be passed in so that only a subset of embeddings are loaded.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Code is well-documented:
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • CNN text classification (including docs)

Reviewers

@gigasquid

@gigasquid
Copy link
Member

Thanks for your work in improving this. I'm looking forward to taking it for a spin 😸

Copy link
Member

@gigasquid gigasquid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! The glove trains much faster and for the first time, I was able to train word2vec on my laptop. Great job 💯 👏

(println "Finished")
{:num-embed dim :word2vec word2vec})))
(defn- load-w2v-vectors
"Lazily loads the word2vec vectors given a data input stream `dis`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring

@gigasquid gigasquid merged commit 0e57930 into apache:master Jan 13, 2019
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Modifying clojure CNN text classification example

* Small fixes

* Another minor fix
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants